NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Improving LLM Safety Alignment with Dual-Objective Optimization

Zhao, Xuandong; Cai, Will; Shi, Tianneng; Huang, David; Lin, Licong; Mei, Song; Song, Dawn (April 2026, ICML)

Full Text Available
OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models

Cheng, Ziheng; Huang, Yixiao; Xu, Hui; Sojoudi, Somayeh; Zhao, Xuandong; Song, Dawn; Mei, Song (April 2026, NeurIPS)

Full Text Available
GuardAgent: safeguard LLM agents via knowledge-enabled reasoning

Xiang, Zhen; Zheng, Linzhi; Li, Yanjie; Hong, Junyuan; Li, Qinbin; Xie, Han; Zhang, Jiawei; Xiong, Zidi Xiong; Xie, Chulin; Yang, Carl; et al (March 2026, ICML'25: Proceedings of the 42nd International Conference on Machine Learning)

Full Text Available
OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models

Cheng, Ziheng; Huang, Yixiao; Xu, Hui; Sojoudi, Somayeh; Zhao, Xuandong; Song, Dawn; Mei, Song (September 2025, Conference on Neural Information Processing Systems)

Full Text Available
A Sustainable AI Economy Needs Data Deals That Work for Generators

Jia, Ruoxi; Oala, Luis; Xiong, Wenjie; Ge, Suqin; Wang, Jiachen T; Kang, Feiyang; Song, Dawn (August 2025, The Thirty-Ninth Annual Conference on Neural Information Processing Systems)

Full Text Available
HADES: Range-Filtered Private Aggregation on Public Data

Liu, Xiaoyuan; Trieu, Ni; Gupta, Trinabh; Ahmad, Ishtiyaque; Song, Dawn (July 2025, the International Conference on Very Large Data Bases (VLDB) 2025.)

Full Text Available
DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks

https://doi.org/10.1109/SP61157.2025.00250

Liu, Yupei; Jia, Yuqi; Jia, Jinyuan; Song, Dawn; Gong, Neil Zhenqiang (May 2025, IEEE)

Full Text Available
Data Shapley in One Training Run

Wang, Jiachen T; Mittal, Prateek; Song, Dawn; Jia, Ruoxi (January 2025, International Conference on Learning Representations)

Full Text Available
Air-bench 2024: A safety benchmark based on regulation and policies specified risk categories

Zeng, Yi; Yang, Yu; Zhou, Andy; Tan, Jeffrey Ziwei; Tu, Yuheng; Mai, Yifan; Klyman, Kevin; Pan, Minzhou; Jia, Ruoxi; Song, Dawn (April 2025, 13th International Conference on Learning Representations (ICLR 2025))

Foundation models (FMs) provide societal benefits but also amplify risks. Governments, companies, and researchers have proposed regulatory frameworks, acceptable use policies, and safety benchmarks in response. However, existing public benchmarks often define safety categories based on previous literature, intuitions, or common sense, leading to disjointed sets of categories for risks specified in recent regulations and policies, which makes it challenging to evaluate and compare FMs across these benchmarks. To bridge this gap, we introduce AIR-BENCH 2024, the first AI safety benchmark for language models aligned with emerging government regulations and company policies, following the regulation-based safety categories grounded in the AI risks taxonomy, AIR 2024. AIR 2024 decomposes 8 government regulations and 16 company policies into a four-tiered safety taxonomy with 314 granular risk categories in the lowest tier. AIR-BENCH 2024 contains 5,694 diverse prompts spanning these categories, with manual curation and human auditing to ensure quality. We evaluate leading language models on AIR-BENCH 2024, uncovering insights into their alignment with specified safety concerns. By bridging the gap between public benchmarks and practical AI risks, AIR-BENCH 2024 provides a foundation for assessing model safety across jurisdictions, fostering the development of safer and more responsible AI systems.
more » « less
Full Text Available
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models

https://doi.org/10.18653/v1/2024.emnlp-main.732

Zeng, Yi; Sun, Weiyu; Huynh, Tran; Song, Dawn; Li, Bo; Jia, Ruoxi (November 2024, Association for Computational Linguistics)

Full Text Available

« Prev Next »

Search for: All records